{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "eVHN07TOeRge"
      },
      "source": [
        "# Semantic Search With Searchkit and Elasticsearch\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/searchkit/searchkit/blob/main/notebooks/semantic-search.ipynb)\n",
        "\n",
        "This notebook setups a pipeline and a search index to perform semantic search on a dataset of 5000 imdb movies. We will use the sentence-transformers minilm embedding model. \n",
        "\n",
        "This is part of a guide on how to use Searchkit to build a semantic search application. You can find the full guide [here](https://searchkit.co/docs/guides/semantic-search/)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zdzl8tmZfr3y"
      },
      "source": [
        "## Install packages and import modules\n",
        "Before you start you need to install all required Python dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9NM_6fGFURcz"
      },
      "outputs": [],
      "source": [
        "# install packages\n",
        "!python3 -m pip install -qU sentence-transformers eland elasticsearch transformers\n",
        "\n",
        "# import modules\n",
        "import pandas as pd, json\n",
        "from elasticsearch import Elasticsearch\n",
        "from getpass import getpass\n",
        "from urllib.request import urlopen"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vKU9L8o2FodV"
      },
      "source": [
        "## Deploy MiniLM-L6-v2 model into Elasticsearch\n",
        "\n",
        "We are using eland to deploy the model into Elasticsearch. We are connecting to elasticsearch using a cloud id and api-key.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "vQkKn02j_FfJ"
      },
      "outputs": [],
      "source": [
        "API_KEY = getpass(\"Elastic deployment API Key\")\n",
        "CLOUD_ID = getpass(\"Elastic deployment Cloud ID\")\n",
        "!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $API_KEY --start"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MGfkUDWDMkc4"
      },
      "source": [
        "## Connect to Elasticsearch\n",
        "\n",
        "Create a elasticsearch client instance with your deployment `Cloud Id` and `API Key`. In this example, we are using the `API_KEY` and `CLOUD_ID` value from previous step."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XGi175RbJhVQ"
      },
      "outputs": [],
      "source": [
        "es = Elasticsearch(\n",
        "  cloud_id=CLOUD_ID,\n",
        "  api_key=API_KEY,\n",
        "  request_timeout=600\n",
        ")\n",
        "\n",
        "# should return cluster info if successfully connected\n",
        "es.info()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8FoZ5TBrIqOT"
      },
      "source": [
        "## Create an Ingest pipeline\n",
        "\n",
        "We need to create a text embedding ingest pipeline to generate vector (text) embeddings for `Plot` field.\n",
        "\n",
        "The pipeline below is defining a processor for the [inference](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) to the embedding model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "geY7WLh7Ky-k"
      },
      "outputs": [],
      "source": [
        "# ingest pipeline definition\n",
        "PIPELINE_ID=\"imdb-movies-pipeline\"\n",
        "\n",
        "es.ingest.put_pipeline(\n",
        "  id=PIPELINE_ID, \n",
        "  processors=[{\n",
        "    \"inference\": {\n",
        "      \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
        "      \"target_field\": \"plot_embedding\",\n",
        "      \"field_map\": {\n",
        "        \"Plot\": \"text_field\"\n",
        "      }\n",
        "    }\n",
        "  }]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IW-GIlH2OxB4"
      },
      "source": [
        "## Create Index with mappings\n",
        "\n",
        "We will now create an elasticsearch index with correct mapping before we index documents."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xAkc1OVcOxy3"
      },
      "outputs": [],
      "source": [
        "INDEX_NAME=\"imdb-movies-semantic-search\"\n",
        "\n",
        "SHOULD_DELETE_INDEX=True\n",
        "\n",
        "INDEX_MAPPING = {\n",
        "    \"properties\": {\n",
        "      \"plot_embedding\": {\n",
        "        \"properties\": {\n",
        "          \"predicted_value\": {\n",
        "            \"type\": \"dense_vector\",\n",
        "            \"dims\": 384,\n",
        "            \"index\": True,\n",
        "            \"similarity\": \"cosine\"\n",
        "          }\n",
        "        }\n",
        "      }\n",
        "    }\n",
        "  }\n",
        "\n",
        "INDEX_SETTINGS = {\n",
        "    \"index\": {\n",
        "      \"number_of_replicas\": \"1\",\n",
        "      \"number_of_shards\": \"1\",\n",
        "      \"default_pipeline\": PIPELINE_ID\n",
        "    }\n",
        "}\n",
        "\n",
        "# check if we want to delete index before creating the index\n",
        "if(SHOULD_DELETE_INDEX):\n",
        "  if es.indices.exists(index=INDEX_NAME):\n",
        "    print(\"Deleting existing %s\" % INDEX_NAME)\n",
        "    es.indices.delete(index=INDEX_NAME, allow_no_indices=True, ignore_unavailable=True)\n",
        "\n",
        "print(\"Creating index %s\" % INDEX_NAME)\n",
        "es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WOGsvnGveAoP"
      },
      "source": [
        "## Index data to elasticsearch index\n",
        "\n",
        "Let's index sample blogs data using the ingest pipeline.\n",
        "\n",
        "Note: Before we begin indexing, ensure you have [started your trained model deployment](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ree_NDh-eRgh"
      },
      "outputs": [],
      "source": [
        "url = \"https://raw.githubusercontent.com/searchkit/searchkit/main/sample-data/movies/movies.json\"\n",
        "response = urlopen(url)\n",
        "movies = json.loads(response.read())\n",
        "\n",
        "actions = []\n",
        "for movie in movies:\n",
        "    actions.append({\"index\": {\"_index\": INDEX_NAME}})\n",
        "    actions.append({\n",
        "        \"title\": movie[\"Title\"],\n",
        "        \"Plot\": movie[\"Plot\"],\n",
        "        \"Genre\": movie[\"genres\"],\n",
        "    })\n",
        "es.bulk(index=INDEX_NAME, operations=actions)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xPPHg8K8T3wY"
      },
      "source": [
        "## Querying\n",
        "The next step is to check everything is working. We are going to do a simple search using the model `sentence-transformers__all-minilm-l6-v2`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 566
        },
        "id": "c4G5V9wmU9C5",
        "outputId": "c8f0cc24-5713-4560-8a5d-c42da562a670"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>_index</th>\n",
              "      <th>_id</th>\n",
              "      <th>_score</th>\n",
              "      <th>_source.title</th>\n",
              "      <th>_source.Plot</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>ERERDooB48GkE85ltEPi</td>\n",
              "      <td>0.729052</td>\n",
              "      <td>Toy Story</td>\n",
              "      <td>A cowboy doll is profoundly threatened and jea...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>CxERDooB48GkE85ltEfi</td>\n",
              "      <td>0.729052</td>\n",
              "      <td>Toy Story</td>\n",
              "      <td>A cowboy doll is profoundly threatened and jea...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>BBERDooB48GkE85ltEvj</td>\n",
              "      <td>0.723086</td>\n",
              "      <td>How the Grinch Stole Christmas</td>\n",
              "      <td>Big budget remake of the classic cartoon about...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>ixERDooB48GkE85ltEvj</td>\n",
              "      <td>0.718766</td>\n",
              "      <td>Small Soldiers</td>\n",
              "      <td>When missile technology is used to enhance toy...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>jhERDooB48GkE85ltEPi</td>\n",
              "      <td>0.713659</td>\n",
              "      <td>Toy Story 2</td>\n",
              "      <td>When Woody is stolen by a toy collector, Buzz ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>hxERDooB48GkE85ltEfi</td>\n",
              "      <td>0.713659</td>\n",
              "      <td>Toy Story 2</td>\n",
              "      <td>When Woody is stolen by a toy collector, Buzz ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>TxERDooB48GkE85ltEbi</td>\n",
              "      <td>0.709669</td>\n",
              "      <td>Ponyo</td>\n",
              "      <td>An animated adventure about a five-year-old bo...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>RhERDooB48GkE85ltErj</td>\n",
              "      <td>0.709669</td>\n",
              "      <td>Ponyo</td>\n",
              "      <td>An animated adventure about a five-year-old bo...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>ZxERDooB48GkE85ltEbi</td>\n",
              "      <td>0.706249</td>\n",
              "      <td>The Passion of the Christ</td>\n",
              "      <td>A film detailing the final hours and crucifixi...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>imdb-movies-semantic-search</td>\n",
              "      <td>YRERDooB48GkE85ltErj</td>\n",
              "      <td>0.706249</td>\n",
              "      <td>The Passion of the Christ</td>\n",
              "      <td>A film detailing the final hours and crucifixi...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                        _index                   _id    _score  \\\n",
              "0  imdb-movies-semantic-search  ERERDooB48GkE85ltEPi  0.729052   \n",
              "1  imdb-movies-semantic-search  CxERDooB48GkE85ltEfi  0.729052   \n",
              "2  imdb-movies-semantic-search  BBERDooB48GkE85ltEvj  0.723086   \n",
              "3  imdb-movies-semantic-search  ixERDooB48GkE85ltEvj  0.718766   \n",
              "4  imdb-movies-semantic-search  jhERDooB48GkE85ltEPi  0.713659   \n",
              "5  imdb-movies-semantic-search  hxERDooB48GkE85ltEfi  0.713659   \n",
              "6  imdb-movies-semantic-search  TxERDooB48GkE85ltEbi  0.709669   \n",
              "7  imdb-movies-semantic-search  RhERDooB48GkE85ltErj  0.709669   \n",
              "8  imdb-movies-semantic-search  ZxERDooB48GkE85ltEbi  0.706249   \n",
              "9  imdb-movies-semantic-search  YRERDooB48GkE85ltErj  0.706249   \n",
              "\n",
              "                    _source.title  \\\n",
              "0                       Toy Story   \n",
              "1                       Toy Story   \n",
              "2  How the Grinch Stole Christmas   \n",
              "3                  Small Soldiers   \n",
              "4                     Toy Story 2   \n",
              "5                     Toy Story 2   \n",
              "6                           Ponyo   \n",
              "7                           Ponyo   \n",
              "8       The Passion of the Christ   \n",
              "9       The Passion of the Christ   \n",
              "\n",
              "                                        _source.Plot  \n",
              "0  A cowboy doll is profoundly threatened and jea...  \n",
              "1  A cowboy doll is profoundly threatened and jea...  \n",
              "2  Big budget remake of the classic cartoon about...  \n",
              "3  When missile technology is used to enhance toy...  \n",
              "4  When Woody is stolen by a toy collector, Buzz ...  \n",
              "5  When Woody is stolen by a toy collector, Buzz ...  \n",
              "6  An animated adventure about a five-year-old bo...  \n",
              "7  An animated adventure about a five-year-old bo...  \n",
              "8  A film detailing the final hours and crucifixi...  \n",
              "9  A film detailing the final hours and crucifixi...  "
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "query = {\n",
        "  \"field\": \"plot_embedding.predicted_value\",\n",
        "  \"k\": 10,\n",
        "  \"num_candidates\": 50,\n",
        "  \"query_vector_builder\": {\n",
        "    \"text_embedding\": {\n",
        "      \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
        "      \"model_text\": \"a film about toys coming alive\"\n",
        "    }\n",
        "  }\n",
        "}\n",
        "\n",
        "response = es.search(\n",
        "    index=INDEX_NAME,\n",
        "    source=[\"id\", \"title\", \"Plot\"],\n",
        "    knn=query)\n",
        "\n",
        "\n",
        "results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))\n",
        "\n",
        "# shows the result\n",
        "results\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3.11.3 64-bit",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.3"
    },
    "vscode": {
      "interpreter": {
        "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
